尽管他们成功了,但了解卷积神经网络(CNN)如何有效地学习高维功能仍然是一个基本挑战。一个普遍的看法是,这些模型利用自然数据(例如图像)的组成和分层结构。然而,我们对这种结构如何影响性能,缺乏定量的理解,例如训练样本数量的概括误差的衰减率。在本文中,我们研究了内核制度中的深入CNN:i)我们证明了相应的内核及其渐近学的光谱继承了网络的层次结构; ii)我们使用概括范围来证明深CNN适应目标函数的空间尺度; iii)我们通过计算教师学生环境中误差的衰减率来说明这一结果,在教师学生的设置中,对另一个具有随机发射参数的深CNN的输出进行了深入的CNN训练。我们发现,如果教师函数取决于输入变量的某些低维基集,则速率由这些子集的有效维度控制。相反,如果教师函数取决于整个输入变量,则错误率与输入维度成反比。有趣的是,这意味着尽管具有层次结构,但深CNN产生的功能太丰富了,无法在高维度上有效地学习。
translated by 谷歌翻译
人们普遍认为,深网的成功在于他们学习数据功能的有意义表示的能力。然而,了解该功能学习何时以及如何提高性能仍然是一个挑战:例如,它对经过对图像进行分类的现代体系结构有益,而对于在相同数据上针对同一任务培训的完全连接的网络是有害的。在这里,我们提出了有关此难题的解释,表明特征学习可以比懒惰训练(通过随机特征内核或NTK)更糟糕,因为前者可以导致较少的神经表示。尽管已知稀疏性对于学习各向异性数据是必不可少的,但是当目标函数沿输入空间的某些方向恒定或平滑时,这是有害的。我们在两个设置中说明了这种现象:(i)在D维单元球体上的高斯随机函数的回归,以及(ii)图像基准数据集的分类。对于(i),我们通过训练点数来计算概括误差的缩放率,并证明即使输入空间的尺寸很大,不学习特征的方法也可以更好地推广。对于(ii),我们从经验上表明,学习特征确实会导致稀疏,从而减少图像预测因子的平滑表示。这一事实是可能导致性能恶化的,这与沿差异性的平滑度相关。
translated by 谷歌翻译
在代理人的观察是部分或嘈杂的情况下,钢筋学习通常难以进行部分观察到的马尔可夫决策过程(POMDPS)。为了在POMDPS中寻求良好的表现,一个策略是通过一个有限的内存赋予代理,其更新由策略管理。但是,在这种情况下,策略优化是非凸的,可以导致随机初始化的训练性能不佳。通过约束内存架构,可以经验改善性能,然后牺牲最佳性以方便培训。在这里,我们在两个假设检测问题中研究了这一权衡,类似于双臂强盗问题。我们比较两个极端情况:(i)随机存取存储器,其中允许使用$ M $ Memory状态之间的任何转换,并且(ii)代理可以访问其最后一个$ M $操作和奖励的固定内存。对于(i),Q $扮演最糟糕的手臂Q $令人难以指数较小的价格为最佳政策。我们的主要结果是表明,尽管内存架构简化了^ m $,即$ q \ sim \ alpha ^ {2 ^ m} $ \ alpha <1 $。此外,我们遵守从随机初始化的训练导致(i)的结果非常差,并且由于内存架构上的约束,(ii)的结果显着更好。
translated by 谷歌翻译
卷积神经网络执行了对数据的本地和平移 - 不变的处理:量化这两个方面中的哪一个是他们的成功仍然是一个挑战。我们在核心学生框架内研究了核心回归的师父框架中的这个问题,它使用了给定滤波器大小的简单卷积架构的神经切线内核的启发。使用从物理学中的启发式方法,我们发现在确定学习曲线指数$ \ beta $的宽恕案例中,该位置是关键的关键,而是将测试错误$ \ epsilon_t \ sim p ^ { - \ beta} $ to to the培训设定$ p $),而平移不变性则不是。特别是,如果老师的滤波器大小比学生$ s $小于学生,$ \ beta $的函数仅限于$ s $的函数,并且不依赖于输入维度。我们在经验上确认了我们对$ \ Beta $的预测。我们通过使用自然普遍性假设来得出结论,利用覆盖训练集的大小减少的山脊的内核回归导致我们在缺陷案件中获得的类似学习曲线指数。
translated by 谷歌翻译
理解为什么深网络可以在大尺寸中对数据进行分类仍然是一个挑战。已经提出了它们通过变得稳定的差异术,但现有的经验测量值得支持它通常不是这种情况。我们通过定义弥散术的最大熵分布来重新审视这个问题,这允许研究给定规范的典型的扩散术。我们确认对基准数据集的稳定性与基准数据集的性能没有强烈关联。相比之下,我们发现,对于普通转换的稳定性,R_F $的稳定性与测试错误$ \ epsilon_t $相比。在初始化时,它是初始化的统一,但在最先进的架构培训期间减少了几十年。对于CiFar10和15名已知的架构,我们发现$ \ epsilon_t \约0.2 \ sqrt {r_f} $,表明获得小$ r_f $非常重要,无法实现良好的性能。我们研究R_F $如何取决于培训集的大小,并将其与简单的不变学习模型进行比较。
translated by 谷歌翻译
Mean-field games have been used as a theoretical tool to obtain an approximate Nash equilibrium for symmetric and anonymous $N$-player games in literature. However, limiting applicability, existing theoretical results assume variations of a "population generative model", which allows arbitrary modifications of the population distribution by the learning algorithm. Instead, we show that $N$ agents running policy mirror ascent converge to the Nash equilibrium of the regularized game within $\tilde{\mathcal{O}}(\varepsilon^{-2})$ samples from a single sample trajectory without a population generative model, up to a standard $\mathcal{O}(\frac{1}{\sqrt{N}})$ error due to the mean field. Taking a divergent approach from literature, instead of working with the best-response map we first show that a policy mirror ascent map can be used to construct a contractive operator having the Nash equilibrium as its fixed point. Next, we prove that conditional TD-learning in $N$-agent games can learn value functions within $\tilde{\mathcal{O}}(\varepsilon^{-2})$ time steps. These results allow proving sample complexity guarantees in the oracle-free setting by only relying on a sample path from the $N$ agent simulator. Furthermore, we demonstrate that our methodology allows for independent learning by $N$ agents with finite sample guarantees.
translated by 谷歌翻译
Foundation models are redefining how AI systems are built. Practitioners now follow a standard procedure to build their machine learning solutions: download a copy of a foundation model, and fine-tune it using some in-house data about the target task of interest. Consequently, the Internet is swarmed by a handful of foundation models fine-tuned on many diverse tasks. Yet, these individual fine-tunings often lack strong generalization and exist in isolation without benefiting from each other. In our opinion, this is a missed opportunity, as these specialized models contain diverse features. Based on this insight, we propose model recycling, a simple strategy that leverages multiple fine-tunings of the same foundation model on diverse auxiliary tasks, and repurposes them as rich and diverse initializations for the target task. Specifically, model recycling fine-tunes in parallel each specialized model on the target task, and then averages the weights of all target fine-tunings into a final model. Empirically, we show that model recycling maximizes model diversity by benefiting from diverse auxiliary tasks, and achieves a new state of the art on the reference DomainBed benchmark for out-of-distribution generalization. Looking forward, model recycling is a contribution to the emerging paradigm of updatable machine learning where, akin to open-source software development, the community collaborates to incrementally and reliably update machine learning models.
translated by 谷歌翻译
One of the major challenges of machine translation (MT) is ambiguity, which can in some cases be resolved by accompanying context such as an image. However, recent work in multimodal MT (MMT) has shown that obtaining improvements from images is challenging, limited not only by the difficulty of building effective cross-modal representations but also by the lack of specific evaluation and training data. We present a new MMT approach based on a strong text-only MT model, which uses neural adapters and a novel guided self-attention mechanism and which is jointly trained on both visual masking and MMT. We also release CoMMuTE, a Contrastive Multilingual Multimodal Translation Evaluation dataset, composed of ambiguous sentences and their possible translations, accompanied by disambiguating images corresponding to each translation. Our approach obtains competitive results over strong text-only models on standard English-to-French benchmarks and outperforms these baselines and state-of-the-art MMT systems with a large margin on our contrastive test set.
translated by 谷歌翻译
Recent developments of advanced driver-assistance systems necessitate an increasing number of tests to validate new technologies. These tests cannot be carried out on track in a reasonable amount of time and automotive groups rely on simulators to perform most tests. The reliability of these simulators for constantly refined tasks is becoming an issue and, to increase the number of tests, the industry is now developing surrogate models, that should mimic the behavior of the simulator while being much faster to run on specific tasks. In this paper we aim to construct a surrogate model to mimic and replace the simulator. We first test several classical methods such as random forests, ridge regression or convolutional neural networks. Then we build three hybrid models that use all these methods and combine them to obtain an efficient hybrid surrogate model.
translated by 谷歌翻译
We introduce submodel co-training, a regularization method related to co-training, self-distillation and stochastic depth. Given a neural network to be trained, for each sample we implicitly instantiate two altered networks, ``submodels'', with stochastic depth: we activate only a subset of the layers. Each network serves as a soft teacher to the other, by providing a loss that complements the regular loss provided by the one-hot label. Our approach, dubbed cosub, uses a single set of weights, and does not involve a pre-trained external model or temporal averaging. Experimentally, we show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation. Our approach is compatible with multiple architectures, including RegNet, ViT, PiT, XCiT, Swin and ConvNext. Our training strategy improves their results in comparable settings. For instance, a ViT-B pretrained with cosub on ImageNet-21k obtains 87.4% top-1 acc. @448 on ImageNet-val.
translated by 谷歌翻译